Methods for human evaluation of machine translation
نویسندگان
چکیده
Evaluation of machine translation (MT) is a difficult task, both for humans, and using automatic metrics. The main difficulty lies in the fact that there is not one single correct translation, but many alternative good translation options. MT systems are often evaluated using automatic metrics, which commonly rely on comparing a translation to only a single human reference translation. An alternative is different types of human evaluations, commonly ranking between systems or estimations of adequacy and fluency on some scale, or error analyses. We have explored four different evaluation methods on output from three different statistical MT systems. The main focus is on different types of human evaluation. We compare two conventional evaluation methods, human error analysis and automatic metrics, to two lesser used evaluation methods based on reading comprehension and eyetracking. These two methods of evaluations are performed without the subjects seeing the source sentence. There have been few previous attempts of using reading comprehension and eye-tracking for MT evaluation. One example of a reading comprehension study is Fuji (1999) who conducted an experiment to compare Englishto-Japanese MT to several versions of manual corrections of the system output. He found significant differences between texts with large differences on reading comprehension questions. Doherty and O’Brien (2009) is the only study we are aware of using eye-tracking for MT evaluation. They found that the average gaze time and fixation counts were significantly lower for sentences judged as excellent in an earlier evaluation, than for bad sentences. Like previous research we find that both reading comprehension and eye-tracking can be useful for MT evaluation. The results of these methods are consistent with the other methods for comparison between systems with a big quality difference. For systems with similar quality, however, the different evaluation methods often does not show any significant differences.
منابع مشابه
The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملA Comparative Study of English-Persian Translation of Neural Google Translation
Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...
متن کاملبهبود و توسعه یک سیستم مترجمیار انگلیسی به فارسی
In recent years, significant improvements have been achieved in statistical machine translation (SMT), but still even the best machine translation technology is far from replacing or even competing with human translators. Another way to increase the productivity of the translation process is computer-assisted translation (CAT) system. In a CAT system, the human translator begins to type the tra...
متن کاملSurvey of Machine Translation Evaluation
The evaluation of machine translation (MT) systems is an important and active research area. Many methods have been proposed to determine and optimize the output quality of MT systems. Because of the complexity of natural languages, it is not easy to find optimal evaluating methods. The early methods are based on human judgements. They are reliable but expensive, i.e. time-consuming and non-reu...
متن کاملLinguistic-based Evaluation Criteria to identify Statistical Machine Translation Errors
Machine translation evaluation methods are highly necessary in order to analyze the performance of translation systems. Up to now, the most traditional methods are the use of automatic measures such as BLEU or the quality perception performed by native human evaluations. In order to complement these traditional procedures, the current paper presents a new human evaluation based on the expert kn...
متن کاملModern MT Systems and the Myth of Human Translation: Real World Status Quo
This paper objects to the current consensus that machine translation (MT) systems are generally inferior to human translation (HT) in terms of translation quality. In our opinion, this belief is erroneous for many reasons, the both most important being a lack of formalism in comparison methods and a certain supineness to recover from past experience. As a side effect, this paper will provide ev...
متن کامل